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Abstract 

This paper considers the problem of learning, from samples, the de- 
pendency structure of a system of linear stochastic differential equations, 
when some of the variables are latent. In particular, we observe the time 
evolution of some variables, and never observe other variables; from this, 
we would like to find the dependency structure between the observed vari- 
ables - separating out the spurious interactions caused by the (marginal- 
izing out of the) latent variables' time series. We develop a new method, 
based on convex optimization, to do so in the case when the number of 
latent variables is smaller than the number of observed ones. For the case 
when the dependency structure between the observed variables is sparse, 
we theoretically establish a high-dimensional scaling result for structure 
recovery. We verify our theoretical result with both synthetic and real 
data (from the stock market). 



1 Introduction 

Linear stochastic dynamical systems are classic processes that are widely used to 
model time series data in a huge number of fields: financial data [H], biological 
networks of species [28] or genes [2] , chemical reactions [19l [21] , control systems 
with noise [44) . etc. An important task in several of these domains is learning the 
model from data filP ; doing so is often the first step in both data interpretation, 
and making predictions of future values or the effect of perturbations. Often 
one is interested in learning the dependency structure |25| : i.e. identifying, for 
each variable, which set of other variables it directly interacts with. For stock 
market data, for example, this can reveal which other stocks most directly affect 
a given stock. 

We consider model structure learning in a particularly challenging yet widely 
prevalent setting: where (the time series of) some state variables are observed, 
and others are unobserved/latent. We are interested in learning the dependency 
structure between the observed variables. However, the presence of latent time 



series, if not properly accounted for by the model learning procedure, will result 
in the appearance of spurious interactions between observed variables - two 
observed variables that interact with the same unobserved variable may now be 
reported to be interacting. This happens, for example, if one uses the classic 
maximum-likelihood estimator jl6j. and persists even if we have observations 
over a long time horizon. 

Suppose, for illustration, that we are interested in learning the dependency 
structure between the prices of a set of stocks via a linear stochastic model. 
Clearly, stock prices depend not only on each other, but are also jointly in- 
fluenced by several variables that may not be part of our model, for example, 
currency markets, commodity prices etc.; these are latent time series. Their 
presence means that a naive structure learning algorithm (say max-likelihood) 
that takes as input only the stock prices, will report several spurious interac- 
tions; say, e.g. between all stocks that fluctuate with the price of oil. 

Our work involves several significant differences from the large body of work 
on sparse recovery and graphical model learning. One is the fact that our 
samples are dependent on each other, with the degree of dependence governed 
by how finely the system is sampled. Another is the presence of latent variables. 
We put our work in context with related literature in Section [2] 

Clearly there are several issues with regards to fundamental identifiability, 
and sample and computational complexity, that need to be defined and resolved. 
We do so below in the specific context of our model setting. We provide both 
theoretical characterization and guarantees of the problem, as well as numerical 
illustrations for both synthetic data and some real data extracted from stock 
market. 

The rest of the paper is organized as follows: We review the related literature 
in Section[2j We present the main idea and our algorithm in Section[3j Section|4] 
reveals our main result followed by the proof in Section[5] The simulation results 
are included in Section [6l 

2 Related Work 

We organize the most directly related work as follows (recognizing of course 
that these descriptions overlap). 

Sparse Recovery and Gaussian Graphical Model Selection: It is 

now well recognized [221 IHl that a sparse vector can be tractably recovered 
from a small number of linear measurements; and also that these techniques 
can be applied to do model selection (i.e. inferring the Markov graph struc- 
ture and parameters) in Gaussian graphical models [33l |34l [13l [171 ES] ■ While 
ours problem is, in a sense, also one of sparse linear model selection, two differ- 
ences between our setting and these papers is that they do not have any latent 
factors, and theoretical guarantees typically require independent samples. The 
simultaneous presence of both these characteristics is what makes ours a chal- 
lenging setting. In particular, latent factors imply that these techniques will in 
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effect attempt to find models that are dense, and hence not be able to have a 
high-dimensional scaling. Correlation among samples means we cannot directly 
use standard concentration results, and also brings in the interesting issue of 
the effect of sampling frequency; in particular, in our setting one can get more 
samples by finer sampling, but increased correlation means these do not result 
in better consistency. 

Sparse plus Low-Rank Matrix Decomposition: Our results are based 
on the possibility of separating a low-rank matrix from a sparse one, given their 
sum (either the entire matrix, or randomly sub-sampled elements thereof) - see 
[HI m HH SSI E] for some recent results, as well as its applications in graph clus- 
tering 1211 [13], collaborative filtering [37], image coding [521, etc. Our setting 
is different because we observe correlated linear functions of the sum matrix, 
and furthermore these linear functions are generated by the stochastic linear 
dynamical system described by the matrix itself. Another difference is that sev- 
eral of these papers focus on recovery of the low-rank component, while we focus 
on the sparse one. These two objectives have a very different high-dimensional 
scaling in our linear observation setting. 

Inference with Latent Factors: In real applications of data driven infer- 
ence, it is always a concern that whether or not there exist influential factors 
that have never been observed [3TJ |33]. Several approaches to this problem 
are based on Expectation Maximization (EM) (TU [3S]; while this provides a 
natural and potentially general method, it suffers from the fact that it can get 
stuck in local optima (and hence is sensitive to initialization), and that it comes 
with weak theoretical guarantees. The paper ^ takes an alternative, convex 
optimization approach to the latent factor problem in Gaussian graphical mod- 
els, and is of direct relevance to our paper. In [5], the objective is to find the 
number of latent factors in a Gaussian graphical model, given iid samples from 
the distribution of observed variables; they also use sparse and low-rank matrix 
decomposition. Differences between our paper and theirs is that we focus on 
recovering the support of the "sparse part" , i.e. the interactions between the 
observed variables exactly, while they focus on recovery the rank of the low- 
rank part (i.e. the number of latent variables). Our objective requires O(logp) 
samples, theirs requires ^{p). Another major difference is that our observations 
are correlated, and hence sample complexity itself needs a different definition 
(viz. it is no more the number of samples, but rather the overall time horizon 
over which the linear system is observed). 

System Identification: Linear dynamical system identification is a central 
problem in Control Theory |30j . There is a long line of work on this problem 
in that field including recent regularized convex optimization based approaches 
[T5] . Recently, [3] considered the system identification problem as learning de- 
pendence graph of time series, without any latent variables. They implement the 
LASSO; the main contribution is characterizing sample complexity in the pres- 
ence of sample dependence. In our setting, with latent variables, their method 
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returns several spurious graph edges caused by marginalization of latent vari- 
ables. 



Time-series Forecasting: Motivated by finance applications, time-series 
forecasting has got a lot of attention during the past three decades [TU] . In the 
model based approaches, it is assumed that the time-series evolves according 
to some statistical model such as linear regression model [J], transfer function 
model [5], vector autoregressive model [42], etc. In each case, researchers have 
developed different methods to learn the parameters of the model for the purpose 
of forecasting. In this paper, we focus on linear stochastic dynamical systems 
that are an instance of vector autoregressive models. Previous work toward 
estimating this model parameters include ad-hoc use of neural network [T] or 
support vector machine method [26], all without providing theoretical guaran- 
tees on the performance of the algorithm. Our work is different from these 
results because although our method provides better prediction comparing to 
similar algorithm, our main focus is sparse model selection not prediction. Per- 
haps, once a sparse model is selected, one can study the prediction quality as a 
separate subject. 



3 Problem Setting and Main Idea 

This paper considers the problem of structure learning in linear stochastic dy- 
namical systems, in a setting where only a subset of the time series are observed, 
and others are unobserved/latent. In particular, we consider a system with state 
vectors x{t) £ MP and u{t) G M.^ , for t e M+ and dynamics described by 
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where, w{t) G MP^^ is an independent standard Brownian motion vector and 
A* , B* , C* , D* are system parameters. 

Task: We observe the process x{t) for some time horizon < t < T, but not 
the process u{-). We are interested in learning the matrix A*, which captures 
the interactions between the observed variables. 

We will also be interested in a similar objective for an analogous discrete 
time system with parameter < 77 < ta*) ■ 
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for all n G Nq. Here, 'w{n) is a zero-mean Gaussian noise vector with covariance 
matrix T]I(p+r)x{p+r)- The prameter 77 can be thought of as the sampling step; 
in particular notice that as 77 — ?> 0, we recover model ^ from model The 
upper bound on 77 ensures the stability of the discrete time system as required by 
our theorem. Intuitively, crniax(-4*) corresponds to the fastest convergence rate 
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in the system and the upper bound on 77 corresponds to the Nyquist minimum 
samphng rate required for the reconstruction of the signal. As done in j3] , our 
proofs will initially focus on the discrete case ([2]), and derive results for ([T]) 
afterwards. 

(Al) Stable Overall System: We only consider stable systems. In fact, 
we impose an assumption slightly stronger than the stability on the overall 
system. For the continuous system Q, we require D := — A,nax( '^ — ) > 0. 
With slightly abuse of notation, for the discrete system ([2|, we require D := 

i^^^™ > 0, where, Emax := (JmaM + V-^*)- ■ 

As a consequence of this assumption, by Lyapunov theory, the continuous 
system ([T]) has a unique stationary measure which is a zero-mean Gaussian 
distribution with positive definite (otherwise, it is not unique) covariance matrix 
Q* (= R(P+r)x(p+r) giygj^ ^Yie solution of A*Q* + Q*A*'^ + / = 0. Similarly, 
for the discrete time system ([2]), we have A*Q* + Q*A*'^ + ■qA*Q*A*'^ + 1 = 
0. This matrix Q* has the form Q* = [Q* R*'^ ; R* P*], where, Q* and P* 
are the steady-state covariance matrices of the observed and latent variables, 
respectively, and R* is the steady-state cross-covariance between observed and 
latent variables. By stability, Cmin ■— Ainin(Q*) > and I?max ■— ^max{Q*) < 
00. 

Identifiability: Clearly, the above objective of identifying A* is in general 
impossible without some additional assumptions on the model; in particular, 
several different choices of the overall model (including different choices of A* ) 
can result in the same effective model for the x{-) process. x{-) would then 
be statistically identical under both models, and correct identification would 
not be possible even over an infinite time horizon. Additionally, it would in 
general be impossible to achieve identification if the number of latent variables 
is comparable to or exceeds the number of observed variables. Thus, to make 
the problem well-defined, we need to restrict (via appropriate assumptions) the 
set of models of interest. 

3.1 Main Idea 

Consider the discrete-time system ^ in steady state and suppose, for a moment, 
that we ignored the fact that there may be latent time series; in this case, we 
would be back in the classical setting, for which the (population version of) the 
likelihood is 

\\x{i + 1) — x{i) ~ rjAx{i)\\2 ■ 

Lemma 1. For x(-) generated by ([2|, the the optimum A := max^£(A) is 
given by 

A ^ A* + B*R*{Q*)-^. 

Thus, the optimal A is a sum of the original A* (which we want to recover) 
and the matrix B* R* (Q*)^^ that captures the spurious interactions obtained 
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due to the latent time series. Notice that the matrix B* R* {Q*)~^ has the rank 
at most equal to number r of latent time series. We will assume that the number 
of latent time series is smaller than the number of observed ones - i.e. r < p - 
and hence B* R* {Q*)~^ is a low-rank matrix. 



Besides identifying the effect of the latent time series, we would need the true 
model to be such that A* is uniquely identifiable from B*R*{Q*)~^. We choose 
to study models that have a local-global structure where (a) each of the observed 
time series Xi{t) interacts with only a few other observed series, while (b) each 
of the latent series interacts with a (relatively) large number of observed series. 
In the stock market example, for instance, this would model the case where the 
latent series corresponds to macro-economic factors, like currencies or the price 
of oil, that affect a lot of stock prices. 

In particular, let s be the maximum number of non-zero entries in any row 
or column of A* ; it is the maximum number of other observed variables any 
given observed variable directly interacts with. Note that this means A* is a 
sparse matrix. Let L* := B*R*{Q*)-^ and assume it has SVD L* = U*Y.*V*^ , 
and recall that its rank is r. Then, following L* is said to be ^-incoherent 
if /i > is the smallest real number satisfying 



where, e.^'s are standard basis vectors and || • || is vector 2-norm. Smaller values 
of jjL mean the row/column spaces make larger angles with the standard bases, 
and hence the resulting matrix is more dense. 

(A2) Identifiability: We require that the s of the sparse matrix A* and 

the /i of the low-rank L*, which has rank r, satisfy a :— 3^/— < 1. ■ 



Note that here we provide deterministic worst-case conditions on the sparse 
matrix. As shown in [11] [7] , better scaling is possible if we pursue probabilistic 
guarantees. 

3.3 Algorithm 

Recall that our task is to recover the matrix A* given observations of the x{-) 
process. We saw that the max-likelihood estimate (in the population case) was 
the sum of A* and a low-rank matrix; we subsequently assumed that A* is 
sparse. It is natural to use the max-likelihood as the loss function for the sum 
of a sparse and low-rank matrix, and separate appropriate regularizers for each 
of the components. Thus, for the continuous-time system observed up to time 



3.2 Identifiability 
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T, we propose solving 
(l,L)=argmin ^ j\{A + L)x{t)\\l dt -^j\[tf [A + if dx{t) 

+ \a\\A\\i + \l\\L\\.„ 
and for the discrete-time system given n samples, we propose solving 
1 

(4,L) = argmin— — V||a;(i + l)-a;(i)-J7(A + L)a;(i)||2+ Aa||A||i + Al||L||,. 
A.L zri'^n ^ — ' 

' 1=0 

(4) 

Here || • ||i is the £i norm (a convex surrogate for sparsity), and || • ||* is the 
nuclear norm (i.e. sum of singluar values, a convex surrogate for low-rank). The 
optimum yl of Q or ^ is our estimate of A* , and our main result provides 
conditions under which we recover the support of A*, as well as a bound on 
the error in values || A — A* ||oo (maximum absolute value). We provide a bound 
on the error \\L — L*\\2 (spectral norm) for the low-rank part. Notice that the 
discrete objective function goes to the continuous one as 77 — 0. 

3.4 High-dimensional setting 

Note that when A* is a sparse matrix, the actual degrees of freedom between 
the observed variables is smaller than that evinced by the ambient dimension 
p. Indeed, we will be interested in recovering A* with a number of samples n 
that is potentially much smaller than p (for small s). In the special case when 
we are in steady state and L — (i.e. Xl large) the recovery of each row of A* 
is akin to a LASSO [39^ problem (of sparse vector recovery from noisy linear 
measurements) with Q* being the covariance of the design matrix. We thus 
require Q* to satisfy incoherence conditions that are akin to those in LASSO 
(see e.g. ^] for the necessity of such conditions). 

(A3) Incoherence: To control the effect of the irrelevant (not latent) 
variables on the set of relevant variables, we require 

e := 1 - max HQ* lloo4> 0, 

where, Sk is the support of the k*^ row of A* and is the complement of that. 
The norm || • ||oo.i is the maximum of the i'l-norm of the rows. I 



4 Main Results 

In this section, we present our main result for both Continuous and Discrete 
time systems. We start by imposing some assumptions on the regularizers and 
the sample complexity. 

(A4) Regularizers: We need to impose some assumptions on the regular- 
izers to be able to guarantee our result. Let 

m = max f^\\B*\Ui, ^MOm +h(0)|li +{Vv + l)A , 
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be the constant capturing the effect of initial condition and latent variables 
through matrix B* . We impose the following assumptions on the regularizers: 

(A4-1) ^A=^-^^J'<=^. 
(A4-2) ^ = ^ + + 

Note: In practice, we let Xa = c-\/log (4((s + 2r)p + r'^)/5) jnr] and \l = 
d^V^A-, with the constants c,d chosen by cross-validation over prediction per- 
formance. 

(A5) Sample Complexity: In our setting, samples are dependent; in par- 
ticular, the smaller the r\ the more dependent two subsequent samples. Sample 
complexity is thus governed by the total time horizon rjn = T over which we 
observe the system, and not simply n; indeed finer sampling (i.e. smaller 77) 
requires a larger number of samples. For a probability of failure 5, we require 

/4((s + 2r> + r2)\ 
T = nv>^^Jo,(^ S )■ 

Here, if is a constant independent of any other system parameter; for example, 
K >3xl0^ sufHces. 

The above T is required to ensure that the empirical covariance matrix is 
close to the steady-state Q*,R*. Of course the constraint rj < 2/amax{-A*) 
ensures that the sampling intervals cannot be too large. Note that T is the 
total time over which the system is observed; a finer sampling cannot yield a 
smaller horizon, because of increased dependence between samples. 

Let - := 2^ + i-M) ^0 ■■= '^'^ (f ' 5ra«^) • The follow- 
ing (unified) theorem states our main result for both discrete and continuous 
time systems. 

Theorem 1. If assumptions (Al)-(A5) are satisfied, then with probability 1 — S, 
our algorithm outputs a pair {A, L) satisfying 

(a) Subset Support Recovery: Supp{A) c Supp{A*). 

(b) Error Bounds: 

||^-^*||oo < i^Aa and ||£-i*||2 < — ^||L*||2. 

1 - 5po 

(c) Exact Signed Support Recovery: // additionally we have that the small- 
est magnitude Amin of a non-zero element of A* satisfies Amin > i'Aa, then we 
obtain full signed- support recovery Sign{A) = Sign{A*). 

Note: Note that A^, as defined in (A4-1), depends on the sample complex- 
ity T, and goes to as T becomes large. Thus it is possible to get exact signed 
support recovery by making T large. 
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Remark 1: Our result shows that, in sparse and low-rank decomposition 
for latent variable modeling, recovery of only the sparse component seems to be 
possible with much fewer samples - 0{s^ logp) - as compared to, for example, 
the recovery of the exact rank of the low-rank part; the latter was show to 
require Q{p) samples in 

Remark 2: The above theorem shows that, even in the presence of latent 
variables, our algorithm requires a similar number of samples (i.e. upto universal 
constants) as previous work required in the absence of hidden variables. Of 
course, this is true as long as identifiability (A2) holds. Note that the absence 
of such identifiability conditions makes even simple sparse and low-rank matrix 
decomposition [S] ill-posed. Note also that the quantity po, which characterizes 
the error in the low-rank term, goes to as T increases (which decreases Xa)- 

Remark 3: Although our theoretical result shows a scaling proportional 
s'^ for the sample complexity, the theoretical result suggests that the correct 
scaling factor is s^. We suspect our result as well as Bento et al. [3], can be 
tightened and we are currently working on that. 

Illustrative Example: Consider a simple idealized example that helps give 
intuition about the above theorem. Suppose that we are in the continuous time 
setting, where each latent variable j depends only on its own past, updating 
according to ^ = —Xj{t) + '^^ and for each observed variable i depends only on 
its own past and a unique latent variable j(i), i.e., ^ = —Xi{t) -|-Xj(j)(i) -I- 
There are r latent variables, and assume that each latent variable affects exactly 
- observed variables in this way. 

In terms of the matrix A*, the overall (observed -I- latent) system has the 
form given by the matrix below 







c* 





Here A* — — /pxp, C* = 0, and D* = —Irxr- This matrix satisfies stability 
assumption (Al). In the matrix B* , each column has exactly ^ entries that 
are 1, and the remaining are 0. Each row of B* has exactly one entry that is 1, 
and the remaining are 0; note that the columns of B are orthogonal. We start 
from zero initial condition with 77 = (continuous time system). With this, 
i? = 2and ||B*|l^,i = l. 

For this idealized setting, we can exactly evaluate all the quantities we need. 
In particular, it is not hard to show (done in Appendix) that the steady-state 
covariance matrices are Q* — -^{I + BB^) and R* ~ B*^ . The resulting low- 
rank matrix is L* = -^BB^, which gives U = V — ^ -B\ the incoherence 
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parameter fi = r, and hence we need r < i/p/3 by assumption (A2). Moreover, 
we can show that 6* = ^ for this example and hence the assumption (A3) is also 
satisfied. 

Similarly, evaluating the other parameters in Theorem [T] we get that the 
observation time should be T > Ks^ log '^i'^+'^^)p+^^ for structure recovery with 
probability greater than S. In this case, we also have z/ = + and 



Po 



providing the error bounds ||A*-i||oo<(4^ + and 



\\L*^Lh<^X^. 

5 Proof of the Theorem 

In this section, we first introduce some notations and definitions and then, 
provide a three step proof technique to prove the main theorem for the discrete 
time system. The proof of the continuous time system is done via a coupling 
argument in the appendix. 

Before we proceed to the details, we would like to make a high level technical 
remark on the novelties of our proof. There are two key novel ingredients in the 
proof enabling us to get the low sample complexity result in our theorem. The 
first ingredient comes from our new set of optimality conditions inspired by [7] . 
This optimality conditions enable us to certify an approximation of L* while 
certifying the exact sign support of A*. The second ingredient comes from the 
bounds on the Schur complement of the perturbation of positive semi-definite 
matrices [3S]. This result enables us to get a bound on the Schur complement 
of a perturbation of a positive semi-definite matrix of size p with only log(p) 
samples. 

Given a matrix A*, let O be the subspace of matrices whose their support 
is a subset of the matrix A* . The orthogonal projection of a matrix M to fl 
is denoted by VniM). Denote the orthogonal complement space with V.'^ with 
orthogonal projection Pqo(M). 

For any matrix L G if the SVD is L = Ui:V^ , then let r(L) := 

{M|M = UX^ + YV^ioT some X, Y} denote the subspace spanned by all ma- 
trices that have the same column space or row space as L. The orthogonal 
projection of a matrix iV to T is denoted by Pf{N). Denote the orthogonal 
complement space with T*^ with orthogonal projection Vt"- We define a metric 
to measure the closeness of two subspaces 7i and T2 as follows 

P Ti,r2 = max — ^. 

Finally, let T = T{L*) to shorten the notation and L* = U*'S*V* be a singular 
value decomposition. 

The proof steps are as follows: 

• STEP 1: We construct a candidate primal optimal solution {A,L) with 
the desired sparsity pattern using the restricted support optimization 
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problem. We refer to this as oracle problem 

^ n — l 

(A,Z)=arg min ——y^\\x{i + l)-x{i)~rj{A + L)x{{}\ 



) Z;// It ^ 

A:-Pfjc(A) = 

+ Xa\\A\\^ + Xl\\L\\,. 



This oracle is similar to the one used in [8]. Note that this is a proof 
technique, not a method to construct the solution. 



• STEP 2: We Write down a novel set of sufficient (stationary) optimal- 
ity conditions for {A, L) to be the unique solution of the (unrestricted) 
optimization problem Q: 

Lemma 2. IfQnT = {0}, then (A,L), the solution to the oracle prob- 
lem ([5|, is the unique solution of the problem Q if there exists a matrix 
Z G RP^P such that 

(CI) Pn{Z) = XASignfl). (C2) [[^^^(Z)!! < Xa- 

\ / II I I OO 

iC3) WVriZ) ~ XlVV^W <4:pXl. (C4) l|prc(Z)l| <(1-q)Al. 
II II2 II II2 

(C5) - — X] (^(^ + 1) - a;(«) - V{A + L)x{i)^ x{if +Z = 0. 



Upon existence of the solution of the oracle problem not only is the 
solution to the original problem Q, but also satisfies the claim of the 
theorem. 

Lemma 3. Provided Z in Lemma^^ exists, we have 

(a) Supp{A) C Supp{A*). 

(b) \\A-A*\\^ < v\a and ||Z-L*||2 < i^ll^lb 

(c) If A„iin > v\a then Sign{A) = Sign{A*). 

Part (a) is immediate by the constraints of the oracle problem and pro- 
vided the £00 bound in (b) and part (a), the result of part (c) naturally 
follows. We prove part (b) in the Appendix|D] Now, it suffices to construct 
a dual variable Z. 
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• STEP 3: Constructing a dual variable Z that satisfies the sufficient opti- 
mality conditions stated in Lemma [2] First notice that under assumption 
(A2), we have n T = {0} HI] . For matrices M e and iV e T, let 

Um^M- Pt{M) + PnPriM) - TrVnTriM) + ... 
Gn^ N - VniN) + TrVniN) - TnTrVniN) + .... 

It has been shown in [Tl : that if a < 1 then both infinite sums converge. 
Suppose we have the SVD decomposition L = UHV'^ . Let 

where, A is a matrix such that (C5) is satisfied. As a result of our con- 
struction, we have Vn{Z — A) = AASign(^) and by optimality of {A,L), 
we have Vn{Jn) — AASign(j4). This entails that 7'q(A) = and hence 
(CI) is satisfied. 



To show (C3) holds, we need the next lemma. 
Lemma 4. VriJn) = Vt{\lUV^)- 

By our construction, we have Pr{Z — A) = V-ri^LUV^) — Vr{Jn) by 
Lemma |4] Consequently, VriA) ~ and hence (C3) is also satisfied, 
considering the oracle constraint bound pp. 



It suffices to show that (C2) and (C4) are satisfied with high probability. 
This has been shown in the next Lemma. 

Lemma 5. Under assumptions (A1)-(A5), Z satisfies conditions (C2) 
and (C4) with probability 1 — ci exp(— C2n) for some positive constants Ci 
and C2 ■ 

This concludes the proof of the theorem for the discrete time system. 



• STEP 4: Denote Xit) = [x{t) u{t)]^ and let 

r dt V^=\, f dw{t)X{tf. 

Having the result for the discrete time system, it suffices (see proof of 
Theorem LI in [3 for more details) to show that for a given continuous 
time system, there exists a discrete time system with Q*^"' and W'"-' such 
that almost surely, 

Q >v(») _^ >v, 

as n — >■ CX3 for a fixed T — nrj (and hence, 77 — )■ 0). 
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Let Q* be the matrix satisfying the continuous time Lyapunov stabiHty 
equation A*Q* + Q*A*'^ + 1 = and Q*{ri) be the matrix satisfying 
the discrete time Lyapunov stabiHty equation A*Q*{ri) + Q*{ri)A*'^ + 
r]A''Q*{r])A*'^ + 1 = 0. It is easy to see that Q*(r;) ^ Q* as 77 ^- by 
the uniqueness of the stationary distribution. Moreover, by Lemma |12[ 
we know that Q(") Q*{r]) a.s n 00. 

Now, let the initial state of the discrete time system be 

X{t = 0) = {Q*{7j))'/^{Q*)-'^'x{t = Q), 

and the noise w{i) = w(t = irj) — w{t = (i— I)??)- It can be easily checked 
that w{i) ~ Af{0,ril) if the continuous time w{t) is a Brownian motion. 
Thus, x{i) and x{t) are coupled and the almost sure convergence, follows 
from the convergence of random walks to Brownian motions |32j . This 
concludes the proof of the theorem for continuous time systems. 



6 Experimental Results 
6.1 Synthetic Data 

Motivated by the example discussed in the paper, we simulate a similar (but 
different) dynamic system for the purpose of our experiments. Consider the 
system where each latent variable is only evolves by itself, i.e., C* = and D* 
is a diagonal matrix. Moreover, assume that each observed variable is affected 
by exactly two latent variable, i.e., each column of B* has 2p/r non-zeros and 
each row of B* has two non-zeros. We randomly select a support of size s per 
row for A* and draw all the values of A* and B* i.i.d. standard Gaussian. 
To make the matrix A* negative definite (hence, stable), using Gersgorin disk 
theorem [TH|, we put a large-enough negative value on the diagonals of A* and 
D*. 

We generate the data according to the continuous time model. The solution 
to the first order system can be written as 

uit) 

where, e-^ = I + A* + \A*^ + ... is a generalization of the exponential function 
to matrices. We sub-sample this system at points ti = rji for i = 1,2, ... ,n, 
that is 

x{i) 

_ 

The stochastic integral can be estimated by binning the interval and assuming 
the Brownian motion is constant over the bin and hence, can be estimated by a 





x{to) 


-I 




u{to) 


Jt 



'tn 





x{i - 1) 






u{i - 1) 





Iri(i-l) 



dw{T) 
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Control Parameter 

(a) Effect of -q 




Control Parameter 

(b) Effect of r 




Control Parameter ©s 

(c) Effect of s 



Figure 1: Probability of success in recovering the true signed support of A* versus 
tlie control parameter O with p = 200, r = 10 and s = 20 for different values of rj in 
and, with p = 200, s = 
"(b)| and, with p = 200, 

Notice that Fig. 1(c) is plotted versus Q x s which means nri scales with 



1(a) 



1(c) 
73^ 



20 and rj = 0.01 for different number of latent time series 
r = 10 and fixed rj — 0.01 for different sparsity sizes s in 

not 



This means our theoretical result can be tightened. 
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standard Gaussian. For more information on this integration method, we refer 
to Chapter 4 of Shreve [5^ . 

Using this data, we solve Q using accelerated proximal gradient method 
[221 . Motivated by our Theorem, we plot our result with respect to the control 
parameter 9 = \o<y((I+2r)p+r^) ■ P^'^^ values of and by dividing 
the training data into chunks each having consecutive samples and do the cross 
validation over those chunks. Note that this is different from the standard cross 
validation technique due to the dependency of samples. 

Figure [5] shows the phase transition of the probability of success in recovering 
the exact sign support of the matrix A* . We ran three different experiments, 
each investigating the effect of one of the three key parameters of the system 
rj (sampling frequency), r (number of latent variables) and s (sparsity of the 
model). These three figures show that the probability of success curves line up 
if they are plotted versus the correct control parameter. The first two curves 
for rj and r line up versus O, indicating that our theorem suggests the correct 
scaling law for the sample complexity. However, from this experiment, it seems 
that the phase transition probability scales with not . Perhaps the result 
of our theorem and also Bento et al. ^ (for r — Q) can be tightened. 



6.2 Stock Market Data 

We take the end-of-the-day closing stock prices for 50 different companies in the 
period of May 17, 2010 - May 13, 2011 (255 business days). These companies 
(among them, Amazon, eBay, Pepsi, etc) are consumer goods companies traded 
either at NASDAQ or NYSE in USD. The data is collected from Google Finance 
website. Our goal is to observe the stock prices for a period of time and predict 
it for the entire days of the next month with small error. 

Applying our method and pure LASSO [3] to the data, we recover the struc- 
ture of the dependencies among stocks. We represent the result as a graph in 



Fig 6.2 where each company is a node in this graph and there is an edge between 
company i and j if Aij ^ 0. This result shows that the recovered dependency 
structure by our algorithm is order of magnitude sparser than the one recovered 
by pure LASSO. 

To show the usefulness of our algorithm for prediction purposes, we apply 
our algorithm to this data and try to learn the model using the data for n 
(consecutive) days and then compute the mean squared error in the prediction 
of the following month (25 business days). We randomly pick an starting day 
Hq between day 1 and day 255 — 25 — n. Then we learn the model using the 
data from the day to the day no + n (total of n days). Then, we test our 
data on the consecutive 25 days. Finally, we average the error over 10 different 
starting points ng for each value of n. We pick the regularizers by the semi- 
cross validation process explained in the previous section. The ratio || shows 
the ratio of training sample size to the testing sample size. 

Figure [3 (b) | shows the prediction error for both our method and pure LASSO 
[3] method as the train/test ratio increases. It can be seen that our method not 
only have better prediction, but also is more robust. Our algorithm requires 
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(a) Pure LASSO 




(b) Our Algorithm 



Figure 2: Comparison of the stock dependencies recovered by Pure LASSO ^ and 
our algorithm. 



only 3 months of the past data to give a robust estimation of the next month; 
in contrast with almost 6 months requirement of LASSO. However, the error 
of our algorithm is much smaller (by a factor of 6) than LASSO even in the 
steady state. Figure 3(a) shows the sparsity level for our model and the LASSO 
model. The number of latent variabl es ou r model finds varies from 8—12 for 
different train/test ratios. As Figure 3(a) illustrates, our estimated A is order 



of magnitude sparser than the one estimated by LASSO. 



A Proof of Lemma [T] 



Ignoring the term \\x{n + 1) — a^(n)||| which is independent of A, minimization 
of C{A) with this infinite sample size is equivalent to 



min E 

A 



x(nY A^ Ax{n) {x{n + 1) - x{n)Y Ax{n) 

V 



min E 

A 



trace {Ax{n)x{nfA^)-2 {A*x{n)+B*u{t)fAx{n) 
= nun trace (AQ* A^)-2trace (A*Q*A^)-2trace {B* R* A^' 

= nun trace (^(^A - 2 {A* + B* R* {Q*)-^)^Q* A^ 

Here we ignored the term w{n) due to the fact that it is zero mean and inde- 
pendent of x{n) and u{n). This implies that the asympotatic optimizer of £(•) 
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(a) Model Sparsity 
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(b) Prediction Error 

Figure 3: Prediction error and model sparsity versus the ratio of the training/testing 
sample sizes for prediction of the stock price. Prediction error is measured using mean 
squared error and the model sparsity is the number of non-zero entries divided by the 
size of A. 



satisfies A = A* + B*R*{Q*) ^ This concludes the proof of the lemma. 

B Illustrative Example 

In this section, we analyze the illustrative example discussed in Sec |4] For that 
example, Lyapunov stability equation requires 



-2Q* 



-B*R* +R*TB*t 



-2i?*^ + B* 
-2P* 



This entails that R* = \B*'^ and Q* = \{I + B*B*'^) with C,nin = \- It can 

B*B*'^). Thus, the low-rank matrix of 



p+r 



be easily checked that Q* ^ = 2{I — 

interest is 

L* = B*R*Q*-^ 



B*B*'^{I 
P 



B*B 



(1 



)B*B* 



Taking singular value decomposition U*J^*V* of this matrix, we get U* = V* = 
^J^B* and hence ^ — r. Considering s = 1, the identifiability assumption (A2) 

becomes a = < 1 or equivalently, r < 
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Considering assumption (A3), note that Qs^Sk — 1 is just an scalar since 
s = 1. Moreover, Qg^^g^^ is a vector with all entries equal to ^ and hence ^ ^. 

C Proof of Lemma [2] 

Suppose Dji = A — A and — L — L. From condition (C5), —Z is the 
subgradient of the loss function at (A, L) and hence, 

C{A + L)>C{A + L)-1^-Z,Da + Dl). (6) 

Let Za = AASign(l)-F with F = XASigniVn" {Da))- Notice that Vs^iiF) = 
and {F,Da) = Ayi||7'o<:(-Dyi)||i- For Za is in the subgradient of A^||A||i, we 
have 

Xa\\A\\i>Xa\\A\\i-{Za,Da). (7) 

Suppose L ~ UTiV'^ and Vr^iDL) — Ud^dVd are SVD decompositions. 
Now, letZL = \lU*V*^ + Wi-Wq with Wq = (1 - a)XLUDVD and Wi = 
'Pt{XlUV'^) - \lU*V*^. In this construction, we have 

(a) VriWo) = and \\Wq\\2 < (l-a)AL and {Wo,Dl) = (l-a)AL|lPr=(^L)IU- 

(b) Vr-={Wi) = and ||Wi||2 < 4poAl by Lemma|6] 

Let W2 = -Vr-i^LUV^) ~ and notice that Zl = XlUV'^ + M^z. Here, 
we have Vf^{W2) = and ||W^2||2 < A^ by Lemma [6[ Hence, our constructed 
Zl is in the subgradient of AL||i||*, i.e., 

Xl\\L\U>\l\\L\U-{Zl,Dl). (8) 
Combining ([6|-([8|, we get 

C{A + L) + Xa\\A\\i + Xl\\L\\* > CiA + L) + Xa\\A\\i + Xl\\L\\, 

+ (z. Da + Dl) - {Za. Da) - {Zl,Dl) ■ 

Provided (^Z, Da + Dl'^ — {Za, Da) — {Zl,Dl) > 0, we arrive to a contradiction 

with the optimality of {A, L) and the result follows. 

Notice that by first order optimality condition, we have Vq,{Z) — AyiSign(A) = 
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Vn^ZA)- Hence, for some 7 < 1, we have 

Z,Da) - {Za.Da) 

Vn^{Z),Vn^{DA)) + {F,Vn4DA)) + (Vn{Z) -Vn{ZA),Vn{DA) 



=0 by (CI) 

[rn^{Z),rn^iDA)) + \A\\Vir-iDA)\\i (Construction of i^^) 

> -lXA\\Vn4DA)\\i + XA\\Vn4DA)\\i (by (C2)) 
= {1 ~ 7)XA\\rn4DA)\\i- 

Similarly, by first order optimality condition, 'Pf{Z) = Xi^UV^ and by our 
construction, VriZ) = XlU*V*'^ + Wi= VriZh)- Hence, we get 



Z,Dl) -{Zl.Dl) 

= {Vt^{Z),Vt^[Dl)) + {Wo,Vt^[Dl)) + {Vt{Z) - Vt[Zl),Vt{Dl) 



=0 by (C3) 

'^Vt-={Z),Vt'{Dl)) + (1 - a)\L\\VT''{DL)\\* (Construction of Wo) 

> -7(1 - a)XL\\VT^{DL)\U + (1 - a)XL\\VT^iDL)\U (by (C4)) 
= {l-j){l^a)XL\\rT4DL)\U. 

(10) 



Combining ^ and (10 1, we get 

'z,Da + Dl)-{Za,Da)-{Zl,Dl)>0. 
This concludes the proof of the lemma. 



Lemma 6. For Wi and W2 constructed above, we have IIW2II2 < and 
WWih < 4.pXL. 

Proof. First, notice that for aU M, \\'Pt{M)\\^ < 2 |lAf|l2 and hence, 
\\Zl\\2= VTiXLUV^)-Wo ^ (Construction of 1^2) 



< 

< 2 



VriXLUV^ 



XlUV' 



2 

Wo\ 



(Triangle Inequality) 



Wo II 2 (Projection Properties) 



< (3-q)A 
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Using this, we can bound both Wi and VF2. For W2, we have 



2II2 



< 



< 



Vr-iXLUV^ - XlU*V*'' - Wi] 
XlUV^ - XlU*V*^ - Wi 



2 (Triangle Inequahty) 

'_^ + ||Wo||2 (Null Space of T'^) 
Wo II 2 (Projection Properties) 
= ||7'^(Zi) -7'r(^L)||2 + lll^olls (Construction) 

< Po II-^l||2 + 11^0 II 2 (Oracle Constraint) 

< {{3-a)po + {l~a))XL < Xl. 

Since Vf{ZL) — X^UV^ , we can establish 
VriXLUV^) - XlU*V*'^ 



XlUV'^ - XlU*V*^ 



< \\Vr{ZL)-Vf{ZL)\\^ 

< po WZ^W^ + PqXl (Oracle Constraint) 

< ((3 - a)pa + p) Xl- 

This concludes the proof of the lemma. 

D Proof of Lemma [3] 



(Triangle inequality) 



General Notation: For a matrix X € 



pax6 



we use 



□ 



, to de- 



note rows, Xl, . . . , X}, to denote columns and x\^\^ .... X^'' to denote entries. 
Also, for the sets of indecies Si C {!,• • -,0} and 52 C {!,• • -,6}, the matrix 
-^^5152 ^ RI'^iI^I'^^I represents the sub-matrix of X consisting of the rows and 
columns corresponding to index sets Si and 52. 

We prove part (b) of the lemma. By triangle inequality, we have 



LL-L* <\\Vf{L- L*)~Vt{L- L") 



\Vf{L*)- Vt{L*)\\^+ \\Pt{L) - V\ 



< po \\L - L'W +P0 ||L*||2 + ■Pr(^) - ^1 



(Oracle Constraint) 



< po |l - L'W^ + po ||L*||2 + |pr(t/V^) - f/"l/*'^|_^ Ir " -^IL 



< po ||L - L*||_^ + po ||L*||2 + 4po IP - I/'ll^ 



by (C3). 



Hence, 



po 



II II2 1 — 5po 



(SVD) 



(11) 
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Let Q(") = iJ27=ixiiMif and = TiJ27=iU{i)x{i)'^ . Substituting 

x{i + l)-x{i) = riA*x{i) +riB*u{i) +w{i) and L* = B*R*{Q*)-'^ in (C5), we 
get 



1 " 

(12) 



We can rewrite this equation as 

Vn4L - l*)q("' + (I - A* - L*))g(") - - + Z = 0. 

(13) 

Let us only focus on the fc"* row of the system of equations (12). We can 
break down ( 12 ) on the fc*'' row into two sets of hnear equations as follows: 



^(fc)^(") 



r{n) 
'S? 



(14) 



From the first line, we get 
By Lemma [Sj we have 



< 1- 



kPne(L-L*) <\\L-L*\ 



Thus, by Lemma [9] and (CI), we get 



A- A* 



< 



< 



1 - 5po 
2po 



'2 C . 



l-5po ' C,„i„(4-0) 



(Lemmas 10|11 ) 



(15) 



The last inequality follows from the definition of pQ. This concludes the proof 
of the lemma. 

Lemma 7 (Convex Optimality). If A is a solution of Q then there exists 
a matrix Z G M^'^p, called dual variable, such that Z e Ayii9||j4||i and Z G 
Al9||L||* and 



1 " 

— V fx(i + 1) - x(i) - 7j(A + L)x(i)) x(if + Z = 0. 
rjn ^ \ J 



(16) 
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Proof. The proof follows from the standard first order optimality argument. 

□ 



E Proof of Lemma [4] 

The result follows from our construction of Wq, W\ and W2 in the proof of 
Lemma [ij With our dual construction, we have Vf{Jn) = XlUV'^ and hence, 
J„ — XlUV'^ + W2 and by construction, J„ = Vt{XlUV^) + Wo which entails 
^t(^»i) = 'Pt{XlUV^). This concludes the proof of the lemma. 



F Proof of Lemma [5] 



Substituting {A — A* + L — L* )]^^ from the first equation in the second in ( 14 ) 
we get 



By triangle inequality, we get 



St ^SfA' ^^3^'+ "^sl'^ ][Qs%^ Q Sust- 



ain) 



\Vn<={Z)\\ <max 



+ \\Y 



+ max 
fe 

2po 



\W 



_^ I + A. 



(n) 



, , ,1+-^ 2?n.ax||L*||2 (Lemma 141 



4(4-61) 4(4- 



(Lemmas 10|11 1 



(Le 



j|10|ll[ ) 



2po 



1 + 



< 1 



1 — 5po V Cmi 

(1-a) 



1-^)A. 



Aa. 



Hence, condition (C2) is satisfied. 
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To show (C4) also holds, notice that from ([l3|, we have 



< 



< 



2(4 - e) 



The last inequality follows from Lemmas 10 



and 



11 



and the fact that Q^") on 
the support is invertible for the given sample complexity due to Lemma [9j 

Next, notice that L* = B* R* {Q*)~^ and hence the row-space of L* is the 
column/row space of Q* and consequently, for any matrix G T, we have 
'Pt'^{FQ*) ~ 0. Thus, by triangle inequality, we have 



||pr<= ((I+L- A* -L*)g("))||^ 
< \\{A + L-A'-L') fg*"' -Q* 



+ \\Tt- {A + L- a* - L* 



M + L — A* — L* ||Q*||2Vp (Projection Propert 



rs\\A- A'W + L-L* 



\/p llQ*"' ^ + 2^max ) (Triangle Inequality). 



Finally, from (15), (11) and Lemma 13 we get 



I II 2 V 9S\/S 



C„.i„(4-e) ^^^+2(4^ 



(By (A4-2)). 



Hence, condition (C4) is also satisfied. This concludes the proof of the 
lemma. 



G Concentration Results 

In this section we prove the concentration results used throughout the paper. 
Before, we state the results, we want to introduce some useful notations and 
inequalities used to get the results. By the dynamics of the system, we have 



Xi^) 



x{i) 



= iI + rjA*y 



x{0) 
u(0) 

xio) 



wil). 



1=0 



Lemma 8. Under assumptions (A3) and (A5), for any S C {1, 2, . . . ,p} with 
\S\ < s, with high probability we have 



< 1 
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Proof. Using Lemma [9j it can be shown (see Lemma A.l in ^ for example) 
that 



^s^s \ ^ss 



<\\q*s^s{Qss) 1 



3|5k/LS 



I CXD , 1 



12 
''min 



Cmin II Hoc C. 

The result follows from Lemma [12] This concludes the proof of the lemma. 



□ 



Lemma 9. Under assumption (A5), for any S C {l,2,...,p} with \S\ < s, 
with high probability, we have 



AniiT 



> 



Proof. By the Courant-Fischer variational representation [22] , we have 

^min ^2^5^ — ^min (Q-Ss) ~ ^max (^Q^^ ~ ^55 

Q* - Q("^ 



>Cn 



The last inequality follows from Lemma 12 This concludes the proof of the 
lemma. 

□ 

Lemma 10. Under assumptions (A4) and (A5), with high probability, we have 

OXa 



< 



4(4- 



Proof. Let X(i) — [x{i) u{i)] . According to the dynamics of the system, we 
have 



-1 ft J- 

>v(») wii) x{of ((/ + ijA*yy 



n — 1 i—1 



i=l 1=0 



E2{i) 

We bound these two terms separately. Notice that w{i) is distributed Af{Q,riI) 
independent of x{Q) and w(j)'s. Given x{0), we have 
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By stability assumption, we have {Ei{i)'^''^) < (|la;(0)|li + |1m(0)|1^)S^*^^ and 
hence, 

/ n — l \ ^ n— 1 

- E < ^ VAR {wi^),E, 



i=0 



< 



1-^(0)111 + 11^(0)11^ 
r]n{l - S2„^^) 



Consequently, by standard concentration of Gaussian random variables and 
union bound, we get 



ji-i 



Tjn — ' 



i=0 



> e 



p p 

j=i k=i 



ri-l 



-E-(*).^i(*)^'^ 



< 2 exp 

With similar analysis, we get 



i=0 

2i\\xml + \\u{0)g) 



> e 



rjn + log((s + 2r)p + r^)j . 



( n — l \ n—l 

- E < ^ E VAR {wi^hE,i^ 

' i=0 / ' i=0 



< 



r]n{l - S^ax)' 



The last inequality follows from the concentration of random variables |27] 
in particular. 



i^Eii-(o.^^(o('^'iii>^ 



Finally, we get 



1 

-E^»^2(*) 



> e 



y2 
max 



P P 

^EEi 

j = l A;=l 



< exp ( ^^V^ + log((s + 2^)^ + r^) 



-E^«^-^2(*)('=^ 

rjn ^ 



> e 



< 2 exp (-^ — ^^L^jjn + \og{{s + 2r)p + r^) 
V 2(V^+1) 

The result follows for e = g^^^''^^^ . This concludes the proof of the lemma. 

□ 

Lemma 11. Under assumptions (A4) and (A5), with high probability, we have 



< 

oo 4(4 - 
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Proof. We can establish 

^ B* fi?(") - R*) +B*R*{Q*r^ (q* - Q^") 



We bound these two terms separately. For the first term, we have 



< \\B* 



Q* - Q(") 



For the second term, we have 

B*R*{Q*)-Uq* ^Q^^'A <\\B*R*{Q*)~^\\ ^ 



Q*-Q 



(n) 



< \\B*\\^^,a,^,.{R*{Q*)-'^ 



< IIS* 



\oo.l n 



Q* ~Q 



in) 



The result follows from Lemma 13 This concludes the proof of the lemma. 

□ 

Lemma 12. Under assumption (A5) , with high probability, we have 



< 



Proof. Let X{i) = [x{i) u{i)f . Let ^i{i) = E [X{i)] (clearly, n{oo) = 0). We 
have 

^n— 1 n— 1 

n •'^ — ' n ^ — ' L J 

i=0 1=0 

" V ' ^ V ' 

El 



B2 



We bound these three terms, separately. For the first term, we have 



n— 1 n — 1 

- E A^WM^f < - E ^max (ll^(0)|l2 + h(0)|| 
1=0 oo 1=0 



< 



i=0 

1^(0)112 + Il"(0)ll2 ^ 
ri(l — S^ax) 
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For the second term, notice that by independency assumption on w, we have 

n-l 



- It J. 

-Y,E[iX{^)-|,i^)){Xi^)^^i^))' 



n-l i-1 
i=0 1=0 
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= ^ 

n 

= V 



n-l 



Y^(l-iI + rjA*f) (l-il + rjA*) 



i=0 

n-l 



I-{I + vA*f + -{I + vA*f' ] (I-{I + vA*y 
n n 



On the other hand, we have 



hm E 



[{X{^)-^,{^)){X{^)-^,{^)f 
= hm 77 (/ + ryyl*)^' 

1=0 

= lim^Tj (/-(/ + 77-4*)'') (/-(/ + vA*f) 

= ri[l~{I + vA*fy\ 

In the above inequahties, we interchanged hmit and expectation as a result of 
Gaussianity assumption and the stability of the system. Finally we get 



77 II ^ ^(-'- ^n"ax) ^ 



To bound the third term, notice that 



^ n — l 71 — 1 

" i=0 j=0 



n- j 1 



n— j — 1 



n n — ] 



((/ + ,,^*)^f . 



By Lemma 1 in |34j , we have 



m-vi\L> 



< 4exp 



2 

en 



3200r7(n - j) 



n- J 

Consequently, we get 



< 4exp 



. + log ((s + 2r)p + : 



3200?7(n - j) 



. + log((s + 2r)p + r2)^ . 
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Thus, we conclude 



1 



< 4exp 



^ + log {{s + 2r)p + r^)^ 



32OO77 

We want this probabihty to be less than S. Putting all thre parts together, we 
get 



Q*-Q 



(n) 



< 



1 



77(1 -E 



2 ^-l 
max / 



1^(0)112 + ll«(o)||^ 



1 — y2 

max 



(17) 

For m/ > (^D'^ + \\x{0)\\l + \\uiO)\\l^ and e = "f/f — , the resuh fol- 

lows, provide that the probabilities go to zero, i.e.. 



6 „3 



nr] 



> 



3 X 10'^ s 
1)2 6*2 CL 



log 



4((s + 2r)p + r2) 



For large enough values of p, this lower bound dominates the earlier lower bound 
of 7177, hence, we ignore that one. This concludes the proof of the lemma. 

□ 

Lemma 13. Under assumptions (A4) and (A5), with high probability, we have 



Q*-Q 



in) 



0Xa 



^4(4-0) ||S*||_^(^ + 1 



Proof. According to (17), the result follows if e = 
suming p is large enough. 



0\a D 



8(4-9)11 



as- 



□ 



Lemma 14. For sample complexity 
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3 X 106 (I?„,ax + 2C^in) , /4((s + 2r)p + r^) 



£12 {n 

with high probability, we have 



max I *-^miii 



log 



<2(l + ^)2?„_. 



Proof. Since Q* and Q*-"^ are positive semi-definite matrices and 



IS** 



The resuh directly follows from Theorem in [35] for e := - Q*|loo 

2 



4(-D°"'''+2c'" ) considering the fact that || 5**11 2 < Anax ( 1 + "f' 



□ 
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